Reliable Information Storage in 
Memories Designed from 
Unreliable Components* 

By MICHAEL G. TAYLOR 

(Manuscript received April 10, 1988) 

This is the first of two papers which consider the theoretical capabilities 
of computing systems designed from unreliable components. This paper 
discusses the capabilities of memories; the second paper discusses the 
capabilities of entire computing systems. Both present existence theorems 
analogous to the existence theorems of information theory. The fundamental 
result of information theory is that communication channels have a capacity, 
C, such that for all information rates less than C, arbitrarily reliable 
communication can be achieved. In analogy with this result, it is shown that 
each type of memory has an information storage capacity, Q, such that for 
all memory redundancies greater than 1/C arbitrarily reliable information 
storage can be achieved. Since memory components malfunction in many 
different ways, two representative models for component malfunctions are 
considered. The first is based on the assumption that malfunctions of a 
particular component are statistically independent from one use to another. 
The second is based on the assumption that components fail permanently 
but that bad components are periodically replaced with good ones. In both 
cases, malfunctions in different components are assumed to be independent. 
For both models it is shown that there exist memories, constructed entirely 
from unreliable components of the assumed type, which have nonzero 
information storage capacities. 

I. INTRODUCTION 

The problem of designing systems which operate reliably even 
though their components are unreliable has been formulated in many 
different ways. In a typical formulation, one considers some particular 
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system which performs a computation with a nonzero probability of 
error. The problem is to design some other "reliable" system which 
performs the same computation using the same types of components 
but with a smaller probability of error. In fact, the ultimate objec- 
tive is to show that it is possible to design systems, using only 
unreliable components, which perform computations with an arbi- 
trarily small probability of error. Unfortunately, there is no standard 
terminology for describing these systems; therefore, the following 
section introduces the terminology to be used throughout this paper. 

1.1 Definitions 

The computations performed by the computing systems are de- 
scribed in terms of elementary operations where an elementary opera- 
tion is any Boolean function of two binary operands. There are 
sixteen different elementary operations, each one of which can be 
represented by a binary matrix of the type shown in Fig. 1. Typical 
elementary operations are and, or, and modulo-2 addition. The com- 
puting systems to be considered are constructed from components 
which are devices that either perform one elementary operation or 
store one binary digit. The complexity of a system is denned to be 
equal to the number of components within the system. 

In an irredundant computing system, the amount of computation 
performed by the system equals the number of elementary operations 
which are executed. Corresponding to each irredundant computing 
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Fig. 1 — Binarv matrix for and operation. There are 2* = 16 ways of filling 
this table, each one of which describes one of the 16 allowed Boolean functions. 
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system, there are many redundant computing systems which perform 
equivalent computations. These redundant systems are more complex 
than the equivalent irredundant one but, hopefully, they are also 
more reliable. The amount of computation performed by any one of 
these redundant computing systems is denned to be equal to the 
amount of computation performed by the corresponding irredundant 
system. Finally, the redundancy of a computing system equals the 
ratio of the complexity of the system to the amount of computation 
performed by the system. 

To illustrate the use of these terms, consider a system that com- 
putes, in parallel, the modulo-2 sums of the digits in two fc-digit se- 
quences. The system first encodes each sequence of digits into a code 
word from an (n, k)* group code, then forms the modulo-2 sums 
of the digits in these code words, and finally decodes the result. The 
amount of computation equals k since an equivalent irredundant 
computer would simply perform k elementary operations each consist- 
ing of a module-2 sum. If the complexities of the encoder and decoder 
within the redundant system are C E and C D , respectively, the com- 
plexity of the entire system equals Cm + Co + n, where the last term 
arises from the n modulo-2 adders required to perform the desired 
operation. The redundancy of this system equals {C B + C D + n)/k. 

1.2 Historical Background 

Von Neumann was one of the first to propose a system which uses 
redundancy to gain reliability. 1 He considered systems consisting of 
interconnections of identical elementsf where all the elements compute 
either the majority function or the Sheffer-stroke function. The form 
(network topology) of the redundancy network is similar to that of 
the original irredundant network, the precursor. Specifically, each 
element in the precursor is replaced by a set of 3n elements of the 
same type in the redundant network (redundancy = 3n), and each 
interconnection is replaced by a bundle of n interconnections. The Zn 
elements in each set are interconnected in such a way that there are 
n outputs. It is assumed that a malfunction occurs in a particular set 
of elements whenever more than a certain fraction, 6, of the n outputs 
are in error; 6 is chosen to minimize the probability of a malfunction 
within the entire system. Von Neumann showed that, for large n, the 



* n is the length of each code word ; k is the number of information digits in 
each word. 

t The terms "element" and "network element" are used to indicate devices 
consisting of several components (some finite number). 
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probability that one set of Zn Sheffer-stroke elements malfunctions on 
one particular use is 

Pr (malfunction in one set of elts.) B 6.4/(n) J -Kr 8 - 6n/10,000 

where it is assumed that the probability of error for each use of each 
element is 5.10" 3 . Therefore, for this system, the probability of mal- 
function decreases exponentially with the redundancy, provided that 
the redundancy is sufficiently large. Other approaches involving the 
use of more complex modules have led to results similar to those 
of von Neumann. 2 In some cases the resulting network is more ef- 
ficient (less redundant for a given probability of system failure) 
than von Neumann's network. However, in all cases, to achieve an 
arbitrarily small probability of system failure it is necessary to make 
the redundancy arbitrarily large. 

It is interesting to compare von Neumann's results with those 
obtained by Shannon concerning the reliability of communication 
systems. 3 Both show that the probability of error within the system 
can be made arbitrarily small. In the case of communication systems, 
this can be achieved for certain nonzero information rates by choosing 
the constraint length of the code arbitrarily large. The largest infor- 
mation rate for which the probability of error can be made arbitrarily 
small is called the capacity of the communication channel. By making 
an analogy with this result, one might expect that, in the case of a 
computing system, it should be possible to achieve an arbitrarily 
small probability of error for certain bounded values of the redun- 
dancy by choosing the complexity sufficiently large. Extending this 
analogy, the reciprocal of the minimum redundancy for which the 
probability of error can be made arbitrarily small is called the com- 
puting capacity of the computing system. For the systems proposed 
by von Neumann, there is no finite redundancy for which the proba- 
bility of error can be made arbitrarily small ; therefore, these systems 
have a computing capacity of zero. 

The question remained whether there existed any method for 
designing a "reliable" system with a nonzero computing capacity. 
Since it was well known that by using suitable coders and decoders 
it is possible to obtain a "reliable" communication system, it was 
natural to attempt to apply coding techniques to the problem of 
designing a "reliable" computing system. One approach is to con- 
sider coding the inputs to each computing component and decoding 
the output. Elias set out to show that this method could not be used 
to design a general computing system with a nonzero computing 
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capacity. 4 Since Elias desired a negative result, he permitted all 
coders and decoders to be noiseless, and since a general computing 
system must be capable of performing all 16 elementary operations, 
Elias considered 16 types of computing components, each capable 
of performing one of the elementary operations. 

The computing components were divided into two classes. The 
operations performed by components in the first class were those 
represented by matrices containing an odd number of ones and zeros 
and in the second class were those containing an even number. Elias 
showed that components in the first class have a computing capacity 
equal to zero; but that it is possible for components in the second 
class to have a nonzero computing capacity.* 

Unfortunately, the only component in the second class which per- 
forms a nontrivial operation is the modulo-2 adder; furthermore, there 
is no combination of class two components that can perform class 
one operations such as and and or. Therefore Elias concluded that 
this coding technique could not be used to design a general computing 
system with a nonzero computing capacity. 

More recently Winograd and Cowan proposed another scheme for 
designing a "reliable" computing system. 8 Their approach was very 
similar to von Neumann's. However, instead of considering a single 
irrcdundant network as the precursor for the redundant network, 
Winograd and Cowan considered a composite network consisting of 
k copies of this irrcdundant network, each computing independently, 
to be their precursor. If the original irredundant network has a com- 
plexity or then this composite network has complexity ka\ but its re- 
dundancy is still one since each network is capable of performing 
independent operations on different inputs. 

To introduce redundancy into this composite network, one con- 
siders sets of k network elements, the members of each set being the 
corresponding elements in each of the k precursors. The redundant 
network is formed by replacing each set of k elements with n modules. 
These modules have the property that each of their inputs is encoded 
according to some (n, k) block code, thus allowing each one to per- 
form an error correction operation on each of its inputs. Every set 
of n modules performs the appropriate operations on the corrected 
inputs so that the n binary outputs from the n modules form the 



* To achieve this nonzero computing capacity, it is necessary to assume that the 
complexity of the encoders and decoders grows only linearly with the block 
length of the code as in the case of convolutional coders and sequential 
decoders. 8 " 7 
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code word corresponding to the desired result, namely the code word 
whose information digits are equal to the corresponding k outputs in 
the original composite network. 

Winograd and Cowan assumed that modular malfunctions are 
statistically independent from one module to another; furthermore, 
they also assumed that for all modules except the "output modules" 
the probability of a modular malfunction is p , independent of the 
operations performed by that module. The "output modules," those 
modules whose output constitutes the output of the computing system, 
were assumed to be noiseless. To compute a bound on the probability 
of error for this system, each set of n noisy modules can be modeled 
by a set of n noiseless modules, where the output from each module 
is passed through a binary symmetric channel (BSC) with crossover 
probability p . A failure occurs whenever the output from any set of 
n BSC's is such that a noiseless decoder would be unable to decode 
this word correctly. According to the noisy channel coding theorem, 9 
there exist codes for which the probability of such a failure is bounded 

by 

Pr [modular failure] g e~ nS(R) 

where E (R) > for codes with information rates less than the capac- 
ity of a BSC with crossover probability p - Since there are at most 
rur modules within the network and since each module is used only 
once during the computation, the probability of a malfunction any- 
where within the network during the computation is bounded by 

Pr [failure in network] f£ n<r-e~ nBlR) 

which can be made arbitrarily small by making n sufficiently large. 
If the complexity of the modules were fixed, this result would 
imply that the probability of error can be made arbitrarily small by 
making the complexity sufficiently large, while keeping the redundancy 
bounded. Unfortunately, each module must perform encoding and 
decoding which requires a number of operations that grows at least 
linearly with n. Therefore, the complexity of the modules must grow 
at least linearly with n which implies that the redundancy of the 
overall network must grow at least linearly with n rather than being 
bounded as one would have hoped. Therefore, the probability of 
error for the system can be made to approach zero only in the limit 
of infinite redundancy. Hence, Winograd and Cowan's system also 
has a computing capacity equal to zero. 
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1.3 Error Criteria 

Although the term "probability of error" is used in connection 
with each of the three systems discussed in the previous section, the 
error criterion was different for the different systems. Elias, Wino- 
grad, and Cowan assumed that each input to the computing system 
is uncoded and that the output from the system is also uncoded. 
They assumed that for each set of inputs, the system has one correct 
output defined as the output which would be obtained if the system 
were noiseless. If the actual output differs from the correct output, 
the system has made an error. To obtain a probability of error for the 
system that is not lower bounded by the probability of error for 
the output components, they required that the output components 
be noiseless. 

The necessity for using noiseless output devices arises because of 
the requirement that the output be uncoded. Von Neumann avoided 
this problem by assuming that all inputs and outputs are repeated 
n times. He assumed that the result is "correct" provided that the 
fraction of the outputs which are in error is small. Von Neumann's 
assumption that all inputs and outputs must be repeated is a special 
case of the assumption that all inputs and outputs must be coded 
according to some error correcting code. The latter, more general 
assumption is made in this paper. The only condition that is imposed 
is that both the inputs and the output must be coded according to 
the same code so that computing systems are compatible with each 
other. Since the outputs are coded, there must exist classes of outputs 
corresponding to the decoding equivalence classes of the code. A 
result is considered to be correct provided that it is within the class 
which contains the code word corresponding to the desired result. 

The concept of coded inputs and outputs might, at first, seem 
unrealistic since it implies that the user is capable of performing 
error correcting coding and decoding. However, if we consider the 
case of two people communicating with each other, we observe that 
a very complicated process of coding and decoding takes place. Ap- 
propriate redundancy is introduced not only through the inherent 
redundancy in the language but also through the use of "diversity 
channels" which, in this case, correspond to facial expressions, hand 
and arm movements, voice inflections, and so on. Therefore, since 
there is always an appropriate coding used in the transmission of 
information between individuals, it is unrealistic to expect that a 
machine and user could communicate without the use of some type of 
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error correcting procedure. In fact, the condition that all information 
must be coded is a necessary requirement for the reliability of a 
computing system in which every component is noisy. 

1.4 Synopsis 

It is our ultimate objective to show that it is possible to construct 
from unreliable components a reliable computing system where a 
reliable system is defined as a type of system which has a nonzero 
computing capacity. For the first part of the analysis, it is assumed 
that component errors are statistically independent from one compo- 
nent to another and from one use of a particular component to another 
use. A computing system constructed from components of this type 
is called a noisy computing system. A virtually identical analysis 
and similar results apply in the case where components within the 
system fail permanently but where periodic maintenance is performed 
on the system; that is, at regularly spaced times, components which 
have failed are replaced by good ones. In this paper we restrict our 
attention to memories. It is shown that information can be stored 
reliably within a "stable memory," a device constructed entirely 
from unreliable components. The paper following this shows that it 
is possible to design a computing system, using unreliable components, 
that performs operations reliably on information stored within stable 
memories. 

II. STABILITY 

The remainder of this paper is concerned with reliable information 
storage in memories constructed from unreliable components. A mem- 
ory is a device in which information is stored at one time and recov- 
ered at some later time. If a memory is to be useful, it must have 
two important characteristics. First, it should be possible both to 
store information in the memory and to read information from the 
memory at any time specified by the user, or at least at any one of 
a set of discrete times where the members of the set are closely 
spaced. All memories considered in this paper can have information 
stored in them at any time and retrieved from them at any time. 
Second, the information read out of the memory must be identical 
to or at least equivalent to the information originally stored. With 
memories constructed entirely from unreliable components, it is un- 
reasonable to expect that the word read out of the memory will be 
identical to the word stored; however, we can hope that the informa- 
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tion will be preserved. To clarify this idea of information preservation 
we introduce the concept of stability. 

2.1 The Concept 

To illustrate the meaning of stability, let us consider a simple 
memory consisting of one noisy register and a correcting network. 
The state of this memory is defined by the word contained within 
the register. It is assumed that initially (t = 0) a code word from 
some error correcting code is read into the register, thus defining the 
memory's initial state. Since the register is noisy, errors occur in the 
stored digits; hence, the state of the memory is perturbed away from 
the original one. The correcting network monitors the contents of 
the register, performs error correction, and inserts into the register 
an estimate of the original code word. If there were no correcting 
network, the state of the memory would "wander away" from the 
original one; however, the correcting network provides a "restoring 
force" which tends to bring the state of the device back to the original. 
The noise may perturb the state of the device beyond the error cor- 
recting capability of the correcting network. If this happens, we say 
that a memory failure has occurred. To define a memoiy failure more 
precisely, it is necessary to associate with the different input code 
words disjoint classes of states. As long as the state of the memory 
remains within the appropriate class, no memory failure has occurred. 

The redundancy of a memory is defined to be the ratio of the complex- 
ity of the memory to the complexity of an irredundant memory which 
has the same information storage capability. The inputs to the memory 
being considered are code words from an (n, k) block code; hence, the 
memory has a storage capability of k bits. This memory is denoted by 
M k . An irredundant memory with the same information storage cap- 
ability as M k would have a complexity equal to k, since it would consist 
of k one-bit information storage components. If we consider k to be a 
variable, we can speak of a sequence of memories where M k is a typical 
member of the sequence. If the complexity of every memory in this 
sequence is less than or- k, the redundancy of any memory in the sequence 
must be less than a which, by assumption, is independent of k. Hopefully, 
for any T, it is possible to make the probability of a memory failure 
during the time interval ^ t ^ T arbitrarily small by choosing k 
sufficiently large while keeping a bounded. If the sequence of memories 
has this property, we say that the sequence is stable. For convenience 
we refer to a typical member of a stable sequence of memories as a 
stable memory. Here is a more concise definition of stability. 
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2.2 The Definition 

Consider a sequence of memories denoted by {Mi}. The memories 
in this sequence are ordered according to their information storage 
capability; that is, the fcth memory, M k , has a storage capability of 
k bits. The sequence {M<} is called stable if it satisfies the following 
conditions : 

(*) For any k, M k must have 2* allowed inputs which are denoted 
by {I ki },0 <iS 2*. 

(w) With each input there must be associated a class of states of 
M k . The classes associated with different inputs must be disjoint. 
For any k and i, the class of states corresponding to 7 fci is denoted by 

C(I W ). 

(iii) The complexity of M k must be bounded by a-k where a is a 
fixed parameter for any particular sequence. 

(iv) Suppose that at t = any one of the allowed inputs is trans- 
mitted to each memory in the sequence. Let there be no inputs for all 
t > 0. Consider a typical memory M k . Denote the particular input 
that was transmitted to M k by I ki . The probability that the state of 
M k does not belong to C(I ki ) at t = T is denoted by p,AT) and max, 
[p kl (T)] is denoted by p k (T). If the sequence is stable then, for any 
T > and 8 > 0, there must be a k such that p k (T) < 8. 

2.3 Examples of Memories 

To further clarify the meaning of stability, consider two types 
of memories. The first consists of a noisy register without any cor- 
recting network and the second consists of a noisy register with a 
noiseless correcting network. In light of the discussion in Section 2.1, 
we do not expect that memories of the first type can be stable whereas 
we do expect that memories of the second type can be. These expecta- 
tions are correct. 

Consider first a sequence of noisy registers. A typical member of this 
sequence is a register containing n binary digits which define the 
register's state. Let p~denote the probability that any particular digit 
stored in the register is changed during a time interval t. If one is 
interested in the contents of the register only at the times t = 0, t, 2t, 
3r, • • • , a model for the noisy register is the noiseless register together 
with the n binary symmetric channels shown in Fig. 2. The allowed 
inputs to this noisy register, OTCi , are the 2* code words from some (n, k) 
code. The class of states corresponding to a particular input is the decod- 
ing equivalence class corresponding to that code word. The complexity 
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Fig. 2 — Model for a noisy register. The digits in the noiseless register are 
transmitted through the BSC's once every t seconds. 



of DH t is n = k/R, where R is the information rate of the code; therefore, 
the appropriate value of a for this sequence is \/R. 

Suppose that at t = one of the allowed inputs is transmitted to DTI* . 
At t = t the probability of error per digit is p. The probability of a 
memory failure at t = r equals the probability that the word in the 
register at t ■» r does not belong to the decoding equivalence class con- 
taining the original input. Provided that the code is at least as "good" 
as an average random code, the noisy channel coding theorem 9 states 
that Pk(r) is bounded by 

p k {r) ^ exp - [(k/R) E(R)] 

where E(R) is positive for R < 1 - H(p). [1 — H (p) ] is the capacity 
of a BSC with crossover probability p. Since the probability of failure 
is independent of the particular input it is unnecessary to perform 
the maximization over all inputs. 

Next consider the state of the register at time t = T = Lr. Accord- 
ing to the model in Fig. 2, to determine the state of the noisy register 
at t = Lr we must transmit the n binary digits through their respective 
BSC's L times. This is equivalent to transmitting each binary digit 
through L BSC's in series. Since the overall capacity of these chan- 
nels in series decreases asymptotically to zero as L increases, for any 
fixed rate there must exist some L for which the capacity of the L 
channels in series is less than R. Therefore, for any sequence of noisy 
registers with bounded redundancy (fixed a or R) , there must exist 
some L (or T) such that there is no register in the sequence which 
has a sufficiently small probability of failure to satisfy the require- 
ments for stability. Thus, as would be expected, noisy registers are 
not stable. 

As a second example, consider the same sequence of noisy registers 
but this time associate with each register a noiseless correcting net- 
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work. Within each time interval of length t, this correcting network 
takes the output from the n BSC's and maps this vector onto a code 
word which is then inserted into the register to define its new state. 
This type of device is shown in Fig. 3. The operation performed by 
the correcting network is equivalent to that performed by a noiseless 
decoder followed by a noiseless encoder. This operation can be per- 
formed by a correcting network whose complexity is proportional 
to k ; for example, a noiseless sequential decoder followed by a noise- 
less convolutional encoder. 5 " 7 If such a correcting network is used, 
the redundancy is independent of k and therefore can be bounded 
for all k. The probability of a memory failure at t = r is again upper 
bounded by exp - [{k/R) E{R)] but the probability of a memory 
failure at t = Lr is now bounded by 

Vk {Lr) < L-exp - [(k/R) E(R)] 

according to the union bound. Therefore, for any finite L, Pic(Lt) 
approaches zero as k approaches infinity provided that R < 1 — H (p) . 
This shows that a noisy register with a noiseless correcting network 
can be stable. 

III. THE STABILITY OF MEMORIES CONSTRUCTED ENTIRELY FROM 
UNRELIABLE COMPONENTS 

We now show that a noisy register with a noisy correcting network 
can be stable. This proof of stability is extended to memories in which 
components fail permanently but where the components which have 
failed are periodically replaced. 
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Fig. 3 — Noisy register with noiseless correcting network. 
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3.1 The Importance of Low-Density Parity-Chech Codes 

The memories to be considered store information in the form of 
code words from an error correcting code. Each memory consists of 
one or more noisy registers and a noisy correcting network that 
performs operations which are very similar to those performed by a 
decoder. It is our objective to show that noisy memories of this type 
can be stable. Since the complexity of a stable memory is required 
to be bounded by a-k, where k is the information storage capability 
and a is a proportionality factor which does not depend on k, the 
complexity of the correcting network in any stable memory can 
grow only linearly with k. There are only two kinds of correcting 
networks (decoders) known to have this property. One is a correcting 
network based on a sequential decoder 6 ' 7 and the other is a correcting 
network based on a low-density parity-check decoder. 10 

In deciding whether a particular correcting (decoding) procedure 
is suitable for use in a noisy correcting network, one must consider 
whether there are any essential steps in the procedure which could 
not be performed with a small probability of error by a noisy device. 
For example, almost all parity-check decoders are required to com- 
pute the modulo-2 sum of a set of digits where the number of digits 
in the set is proportional to the constraint length, N, of the code. 
To compute the probability that a noisy device makes an error 
in performing this operation, consider the noisy addition network 
shown in Fig. 4. This network, consisting of N — 1 adders (modulo-2) , 
computes the modulo-2 sum of N binary inputs. Let us assume that 
each adder in the network has a probability of error p a and that adder 
errors are statistically independent from one adder to another and 
from one use of a particular adder to another. The output of the noisy 
addition network will be in error if an odd number of adders make 
errors. It can be shown 10 that the probability of this event equals 

p i t 1 - (1 - 2 Po ) A '-' 
Prob of error = — - — — 

This probability approaches Mi, exponentially with N, as N approaches 
infinity. 

The memories under consideration store coded information. To 
make the probability of a memory failure arbitrarily small (as re- 
quired for stability), one would expect that it would be necessary to 
make the constraint length of the code arbitrarily large. Therefore, 
the noisy correcting network must be able to perform error correction 
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Fig. 4 — Modulo-2 addition network. © represents one adder (modulo-2). 

even if the constraint length of the code is very long. This implies 
that no correcting procedure involving a modulo-2 addition operation 
of the type just described is suitable for use in a noisy correcting 
network. For example, the correcting network based on a sequential 
decoder must generate hypothesized branches of a code tree. Each 
digit on one of these branches is computed by forming the modulo-2 
sum of a set of digits where the number of digits in the set is propor- 
tional to the constraint length of the code. The probability that a 
noiseless sequential decoder makes an error can be made arbitrarily 
small only if this constraint length is made arbitrarily large. But 
making the constraint length arbitrarily large makes the probability 
of an addition error within the noisy decoder arbitrarily close to ^. 
Therefore, a noisy correcting network should not be based on a se- 
quential decoder. 

Fortunately the correcting network based on a low-density parity- 
check decoder does not have this problem. A low-density parity-check 
decoder does evaluate parity checks, modulo-2 sums of the digits in 
parity-check sets ; however, the number of digits in each parity-check 
set is not a function of the block length of the code. 

3.2 The Correcting Algorithm 

The memories to be considered consist of several registers and a 
correcting network and they store information in the form of code 
words from a low-density parity-check code. It is our objective to 
show that memories of this type can be stable. The first step is to 
consider the details of the correcting algorithm. This requires a brief 
explanation of low-density parity-check codes. 
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An (N, J, K) low-density parity-check code is defined to be a code 
with block length N such that there are K digits in each parity-check 
set and J parity-check sets containing any particular digit. Such a 
code can be represented by a parity-check matrix which has K ones 
in each row and J ones in each column. For example, an {N = 20, 
J = 4:, K = 5)* low-density parity-check matrix is shown in Fig. 5. 
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Fig. 5 — The parity-check matrix for a (20, 4, 5) low-density parity-check code. 
The digit positions, denoted by d with appropriate subscripts, are numbered in 
the way described in Fig. 6. 

Gallager has described two schemes for decoding low-density parity- 
check codes, both of which are iterative. 10 However, we shall only be 
concerned with the simpler one. Each iteration of this scheme consists 
of first computing all the parity checks and then changing the value 
of any digit that is contained in more than a certain fixed number of 
unsatisfied parity-check constraints (if a parity check equals one, 
the corresponding parity-check constraint is unsatisfied). Provided 
that there were not too many errors initially, each successive itera- 
tion decreases the number of digits in error and, eventually, all parity 
check constraints are satisfied indicating that the resulting word is 
a code word. 

To illustrate how this method works, let us suppose that the digit 
do is in error but that all other digits are correct. In this case, all 
parity-check constraints involving d will be violated, whereas at 
most one parity-check constraint involving any other digit will be 
violated. Therefore, d will be changed whereas all other digits will 
be unchanged. In this way the digit d will be corrected. If there are 



* Notice that K is always greater than /. 
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errors among the digits used to check d , this digit may not be cor- 
rected during the first iteration. However, after one or more iterations 
sufficiently many of these errors may have been corrected to allow 
d to be corrected also. 

This simple decoding algorithm docs have one problem. Recall that 
for any digit d , the original values of all the other digits in the J 
parity-check sets containing d are involved in the determination of 
the new value of d and similarly that the original value of d is used 
in computing the new value of each digit in these J parity-check sets. 
This means that on successive iterations the values of the digits 
involved in making a new estimate of d depend on the previous 
estimate of d . This leads to a complex interrelation between the 
errors which degrades the decoder's performance and, needless to say, 
greatly complicates its analysis. Fortunately Gallager has suggested 
a way to modify the algorithm which at least partially solves this 
problem. Using this modified algorithm, J estimates of each digit are 
made, each one being computed using a different combination of J — 1 
out of the J parity-check sets containing the digit to be estimated. 
The value of a particular estimate is changed if J/2 or more out of 
the J — 1 parity checks are equal to one. 

To construct a correcting network based on this algorithm, one 
starts with J registers of length N. Although the assignment of digit 
estimates to register locations can be arbitrary, one would probably 
choose to assign one estimate of each of the N digits in a code word 
to the corresponding N locations in each register. Since each estimate 
of a digit, say d , is to be made on the basis of J — 1 parity-check 
sets, each containing K - 1 digits other than d , there must be (J - 1) 
■ (K — 1) other digits interconnected with the input to d . Using the 
modified algorithm, it is necessary to specify not only the digits to 
be interconnected but also the appropriate estimate of each digit. 
For example, consider the parity check set (d , d n , d 12 , d 13 , du). The 
appropriate estimate of the digit d n to be used in correcting d is that 
estimate based on the J — 1 parity check sets which omit the one 
containing d . This estimate is used so that the values of the digits 
involved in computing the new estimate of do do not, themselves, 
depend on a previous estimate of d . 

This correcting network performs many iterations. During the 
first iteration each digit is estimated on the basis of the {J — 1) 
• {K - 1) digits to which it is interconnected. Since, during the second 
iteration, the same operations are performed, the resulting second 
estimate of each digit depends on the first estimates of these (J — 1) 
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• (K — 1) interconnected digits which, in turn, depend on the initial 
estimates of a much larger set of digits. The sets of digits involved 
in making successive estimates of some digit, say d , can be repre- 
sented by means of parity-check set trees of the type shown in Fig. 6. 
The branches rising from d represent one set of J — 1 parity-checks 
containing this digit. The interconnected nodes on the first tier of this 
tree represent the digits, other than do, in one of these parity-check 
sets. Each digit on the first tier of the tree is also contained in J — 1 
other parity-check sets. These other parity-check sets are represented 
by the branches rising from the first tier to the second tier of the tree. 
This parity-check set tree can be extended indefinitely. The structure 
of the tree, beyond the first tier, is completely specified by the parity- 
check sets of the codes which, in turn, are specified by the parity-check 
matrix. 

To see how the tree represents the digits involved in estimating 
d , let us suppose that the decoder has performed i iterations. During 
each iteration the digits on each tier of the tree are used to estimate 
the digits on the tier immediately below. Hence, after i iterations, the 
value of each digit, particularly d , has been influenced by the values 
of the digits on the first i tiers above it. 

If a parity-check set tree is extended for many tiers, eventually 
some digit will appear in two different places in the tree. If the first 
repetition within any of the J-N trees occurs on the (m + l)th tier, 
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Fig. 6 — Two representations for the parity-check constraints involving do. At 
the top is one of the J parity-check set trees rising from do. Beneath it are the 
parity-check equations containing do. 
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then, according to Gallager's nomenclature, the code has m independent 
iterations. Notice that this number, m, is a parameter of the code 
which is unrelated either to the statistics of the noise or to the par- 
ticular decoding algorithm. This number plays a particularly impor- 
tant part in the analysis of the correcting procedure as it is shown 
that errors in specific sets of digits are statistically independent 
during these first m iterations. 

The physical configuration of the memory is shown in Fig. 7. 
Within the correcting network there are J-N identical sets of com- 
ponents. Each set of components performs the operations required to 
estimate one particular digit. Let us consider a set of components 
which computes estimates of the digit d . The first operation that 
must be performed by this set of components, the computation of the 
J—l parity checks rising from d , requires {J— 1) ■ {K— 1) two-input 
binary modulo-2 adders. The second operation, deciding whether the 
digit do should be changed, can be performed by a decision device 
(threshold device) the output of which is a 1 if d is to be changed 
and a if d is not to be changed. Finally, the output of this decision 
device must be added modulo-2 to the previous estimate of d , the 
operation requiring one binary adder (modulo-2). Similar operations 
are performed to obtain estimates of all J-N digits. These operations 
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Fig. 7 — Physical configuration of a memory based on a low-density parity- 
check decoder. 
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constitute one correcting cycle (iteration). The correcting network 
performs correcting cycles continuously once information has been 
read into the registers. 

Finally, let us consider the situations which cause the estimate of 
the digit d to be in error after a particular iteration. Since any digit 
do which is in error will be corrected only if a sufficient number of the 
parity checks containing d are equal to 1, ideally, we would like 
every parity check containing d to equal 1 if do is in error but to 
equal if d is correct. If a parity check does not equal the desired 
value we say that the parity check is in error. Thus a parity check 
error depends on errors made by the adders and errors in the digits 
involved in estimating d , but not on the value of do itself. 

To simplify the mathematical analysis, we restrict our attention 
to the class of low-density parity-check codes with J = 21, I = 2, 3, 4, 
. . . and the correcting algorithm stated previously. A set of events 
each one of which alone leads to an error (indicated by e) in the esti- 
mate of the digit d^ after a particular iteration are: 

(i) d = e after previous iteration and J/2 or more parity checks are 
in error. 

(it) d ?± e after previous iteration and J/2 or more parity checks are 
in error. 

(Hi) The decision (threshold) device makes an error. 
The first two conditions demonstrate an interesting property of this 
class of low-density parity-check codes. For this class of codes the 
conditions leading to an error in any digit after any iteration are the 
same whether or not the digit was in error before the iteration. This 
property will help to simplify the following mathematical analysis. 

3.3 The Stability Theorem for Noisy Memories 

Theorem 1: There is a stable sequence of noisy memories where every 
component in every memory has a fixed, nonzero probability of error 
per use. 

Proof: The memories under consideration consist of J registers of 
length N, a noisy correcting network based on a low-density parity- 
check decoder, and a set of communication channels over which the 
inputs are transmitted. The definition of stability given in Section 2.2 
includes specific conditions that must be satisfied by stable memories. 
To prove that the memories under consideration are stable, we must 
show that they satisfy all these conditions. 
The definition of stability requires that there be a set of allowed 
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inputs for each memory. Each allowed input for a memory of the type 
under consideration consists of J copies of a code word from some 
(N, J, K) low-density parity-check code. If R equals the information 
rate of the code, the memory having registers of length N has 2 RN = 2 k 
allowed inputs. This memory is denoted by M k . To define a sequence of 
memories, we allow fc(and N) to be a variable but keep the parameters 
of the code, J and K, fixed.* Let the 2* allowed inputs to M k be denoted 
by [I ki ] < * ^ 2*. With each input we must associate a class of states 
of M k . The classes to be used are essentially the decoding equivalence 
classes of the code. To be more precise, the classes can be described in 
terms of the equilibrium states of M k , denoted by {E ki \ < i ^ 2*, 
where an equilibrium state is one in which every register contains one 
and the same code word. With each input, I ki , there is a corresponding 
equilibrium state, E ki , in which the registers contain the code words 
represented by I ki . The state of M k belongs to the class C(I ki ) if a 
noiseless correcting network (that is, a noiseless low-density parity-check 
decoder) could correct all the errors; in other words, if its final state 
would be the equilibrium state, E ki . 

The definition of stability requires that for each sequence of mem- 
ories there must be an a such that for every k the complexity of M k is 
bounded by a-k. The complexity of the correcting network in M k is 
computed in Section 3.2. If each of the J registers in M k is of length 
N, the correcting network must contain [1 + (J-l) (K—l)] J-N bi- 
nary adders (modulo-2) and J-N decision devices. Since a decision 
device must determine whether J/2 or more out of the J-l parity 
checks are in error, the complexity of each decision device depends 
only on the code parameter, J, which is fixed for any particular se- 
quence of memories. Thus the complexity of each decision device, 
denoted by D, is independent of k. Finally, the noisy registers within 
the memory must contain J-N storage components. Therefore the 
complexity of M k is : 

Complexity of M k = [2 + D + (J - 1)(£ - l)]J-N 

\2 + D + (J - l)(g - 1)]J , 
- R - k - 

Since Gallager 10 has shown that B St 1 — J/K, the complexity propor- 
tionality factor for the sequence of memories under consideration is 
upper bounded by 

* There are low-density parity-check codes for all values of N which are 
integer multiples of K. 10 
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< [ 2 + 2) + ( J - l)(/v - 1)]J 
1 - J/K 

Every component has a nonzero probability of making an error. 
The error probabilities are denoted: 

p, = probability that one particular binary digit within the register 
changes during a time interval t, the time required for one 
correcting cycle. 

p a = probability that an adder makes an error on any one use. 

p d = probability that a decision device makes an error on any one use. 

p c = probability of an error in transmitting a digit across a communi- 
cation channel. 

It is assumed that errors are statistically independent from one com- 
ponent to another and from one use of a particular component to another 
use. 

To prove that the information storage devices in question are stable, 
it must be shown that for any £, the probability of a memory failure 
in M k during the time interval ^ t ^ £r can be made arbitrarily small 
by choosing k sufficiently large. In the remainder of this section, the 
probability of a memory failure for the typical device, M k , is upper 
bounded. To determine whether a memory failure has occurred at some 
particular time we use the following algorithm. 

Imagine that M k becomes noiseless at that time and that the noise- 
less correcting network within M k performs m more iterations where 
m is the "number of independent iterations" described in Section 3.2. 
These are precisely the same operations that a noiseless low-density 
parity-check decoder would perform. If the final state of the hypothet- 
ical noiseless memory is error free (that is, equals the equilibrium 
state corresponding to the original input) then, by definition, no 
memory failure occurred. Thus, to bound the probability of a memory 
failure we must bound the probability that the final error pattern is 
not error free. 

In general, the error pattern within the registers of M k depends on 
all the component errors that have occurred since the original input 
was transmitted to the memory. However, if certain conditions are 
satisfied, the error pattern is a function of only the component errors 
that occurred during the most recent m iterations. If no component 
errors occur during these m iterations and if the required conditions 
are satisfied, the final error pattern will be error free. On the other 
hand, if these conditions are not satisfied we say that a propagation 
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failure has occurred. Therefore, to bound (upper bound) the prob- 
ability of a memory failure, we bound the probability that a propaga- 
tion failure occurs on the rath noiseless iteration of the hypothetical 
noiseless memory. 

The first step is to show that, in the absence of a propagation failure, 
the probability of error for digits stored in the memory (the probability 
of error per digit) has an upper bound, denoted by p , such that p <£ §. 
This can be seen intuitively by comparing the performance of a noisy 
corrector with that of a noiseless corrector, that is, a noiseless low- 
density parity-check decoder. Provided that the initial probability of 
error is not too large, a noiseless corrector decreases the probability of 
error per digit with each iteration until this probability reaches zero. 
If the probability of error for the components within the noisy corrector 
is small, compared with the initial probability of error per digit, for the 
first few iterations the noisy corrector decreases the probability of error 
per digit just as the noiseless one did. However, eventually this proba- 
bility reaches the same order of magnitude as the probability that the 
noisy corrector itself makes an error at which time the probability of 
error per digit reaches an equilibrium value. Notice that although such 
an equilibrium value is attained, it is still possible for errors to occur in 
sufficient digits so that a memory failure results. If a memory failure 
does occur, a propagation failure will also occur and the bound on the 
probability of error per digit will no longer be valid. 

In light of this intuitive argument, we expect that the time at which 
the probability of error per digit is at its maximum value is just 
before the end of the first correcting cycle. In Appendix A upper 
bounds are computed on this probability evaluated just before the 
end of the first and successive correcting cycles. It is shown that, 
provided no propagation failure occurs, these bounds form a mono- 
tonically decreasing sequence; hence, the bound on this probability 
evaluated just before the end of the first correcting cycle, denoted 
by p , is the desired bound. 

The next step is to bound the probability that the initial propaga- 
tion failure occurs at some particular time. A propagation failure 
occurs whenever the error pattern in the memory is related to the 
component errors that occurred ra or more iterations previously. In 
most cases the error pattern depends only on component errors that 
occurred in the last few iterations since any digit errors caused by 
previous component errors would have already been corrected. In 
order for a propagation failure to occur, the effect of component er- 
rors must have propagated from one iteration to the next for at least 
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m iterations. Thus we expect that the probability of such a propaga- 
tion decreases as m is increased and, since increasing k increases m, 
that the probability of a propagation failure can be made arbitrarily 
small by making k sufficiently large. The explicit relationship be- 
tween the probability of the initial propagation failure and k is 
derived in Appendix B. The result is 

Pr [initial propagation failure] < Ck~ p+l 

where C and (3 depend only on the parameters of the code (J and K) 
and the bound on the probability of error per digit, p ', hence, C and 
ft are constants for any particular sequence of memories. 

Some typical values for /? are shown in Table I. For example, if 
J = 14, K = 15, and p = 10~ 8 then /? = 7.55; therefore, in this case, 
the probability that the first propagation failure occurs at some 
particular time is upper bounded by a function that decreases as the 
sixth power of the information storage capability of the memory. By 
choosing the information storage capability sufficiently large, this 
probability can be made arbitrarily small. 

For the moment, let us assume that neither a memory failure nor a 
propagation failure has occurred within M k during the time interval 
^ t < £'r. We now use the bound on the probability of the initial 
propagation failure to find an upper bound on the probability that either 
the initial memory failure or the initial propagation failure occurs at 
t = £'t. To determine whether the initial memory failure occurs at 
/ = £'t we must imagine that M k becomes noiseless at t = £'t and that 
the noiseless correcting network within Af t performs m more iterations. 
As explained previously, the initial memory failure can occur at t = £'t 
only if the initial propagation failure occurs at t = £'t or during the 
m noiseless iterations performed after t = £,'t. Thus the sum of the 
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probabilities that the initial propagation failure occurs at t = £'t, 
(£' + l)r, ••-,(£'+ m)r is an upper bound on the desired probability. 
Therefore, 

Pr [initial memory failure or initial propagation failure 

at/ = £'t] < (m + l)-C-k~'*\ 
Gallager 10 has shown that 

log A'' 



in < 



log [(./ - 1)(K - 1)] 



Therefore, 



m + l < log [(J - 1)(A' =1)1 + * 



log 



Ll - ./A'J 



< log [(J - 1)(A' - 1)] + * 

<*[i^tti] if k>2 - 

(Note :K > J ^ 4). 

The probability that the initial memory failure occurs during the time 
interval ^ t ^ £t is upper bounded by the sum of the probabilities 
that either the initial memory failure or the initial propagation failure 
occurs at t = 0, t, 2t, ■ ■ • , £t. This bound equals 

Pr [failure during time interval ^ t ^ £t] 



Li-j/kY 



< (£ + l).<Mog|— , K | 

= (£ + iK"-/r"' 

where 

C" 4 - — Q— and 0' ± fi - 2. 
1 — J/K 

To show that there is a stable sequence of noisy memories, we must 
show that it is possible to choose J, K, and p such that /3' > 0. Re- 
call that when we speak of a particular sequence of memories, {M,,}, 
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we mean that A; is a variable whereas the values of J , K, and the 
probabilities of component errors are all fixed. Certain conditions have 
already been imposed on J, K and p . These conditions are: 

(i) J = 21, 1-2,3,4..- 
(ii) K > .1 
(Hi) p„ > 2p r + p,. 

(40 p„ > (^ ^[(/C - l)(p + p.)]" 1 + P, + P. 

(») p„ > p„ 

where p a , p,, , p T and p c must all be fixed and greater than zero. To 
demonstrate that there are values of J, K and p which satisfy these 
conditions and for which |3' > 0, consider an example where p a = p, t = 
Pr — Pc — 10 _0 and where p = 10~ 8 . For this example conditions (Hi), 
(iv), and (v) are satisfied for all reasonable values of J and K (that is, 
where J < K « p" 1 ). The values of /3'(J, if, p = 10" 8 ), which corre- 
spond to some typical values of J and K, are shown in Table I. For all 
the values of J and K which are considered, the value of /3' is greater 
than zero. Therefore, in all these cases, the probability of a memory 
failure in M k at t = £t can be made arbitrarily small by making k 
sufficiently large. This proves that there are stable sequences of noisy 
memories. 

3.4 Stability of Memories Constructed from Failure-Prone Components 

Thus far we have restricted our attention to memories in which 
component malfunctions are assumed to be statistically independent 
both from one component to another and from one use of a particular 
component to another use. These assumptions form the basis of a 
mathematical model for the component malfunctions that are com- 
monly attributed to "noise" in the system. Unfortunately, the model 
does not adequately represent the most common type of component 
malfunction that one finds in computing systems: malfunctions where 
individual components fail permanently. 

To see whether memories of the type considered in the previous 
section can be stable, if their components fail permanently, let us 
recall the proof of the stability theorem for noisy memories. In carry- 
ing out this proof, it is necessary to show that there are types of 
memories for which the probability of error per digit can be bounded 
(see Appendix A). If one attempts to find memories in which the 
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components fail permanently and for which a similar bound on the 
probability of error per digit exists, it becomes clear that memories 
constructed from these components cannot have such a time inde- 
pendent bound of value less than %. This is because the probability 
that any particular component has failed increases with time given 
that components fail permanently; hence, if one hypothesizes any 
bound on the probability of error per digit which is less than %, it is 
always possible to find a time at which the hypothesized bound is 
violated, showing that such a bound cannot exist. By using argu- 
ments such as those in Section 2.3, one can obtain a statistical model 
for the errors after each correcting cycle in terms of an equivalent 
channel whose capacity decreases with time. Since the capacity de- 
creases, it is always possible to find a time at which the capacity of 
the equivalent channel is below the information rate of the code used 
in storing the information in the memory, thus precluding the pos- 
sibility of effective error correction and hence the possibility of 
stability. 

Fortunately, in most "nonspace" applications, regular maintenance 
is performed on computing systems, that is, components which have 
failed are periodically replaced with good ones. Numerous specific 
failure probability distributions and maintenance schemes could be 
considered individually; however, for the purpose of this analysis 
we consider instead a general case which includes many of the com- 
mon probability distributions and maintenance schemes. This gen- 
eral case is the one for which it is possible to upper bound, during 
each correcting cycle, the probability that each component has failed 
up to or during that correcting cycle. For example, suppose that each 
component is replaced every T seconds and that p f represents the 
probability that any particular component initially fails during any 
particular correcting cycle. For this example, the desired upper bound 
on the probability of component failure equals T-p f which, by ap- 
propriate choice of T and p f , can be made less than %. Notice that 
we are free to choose both T and p f since, as before, we are only try- 
ing to show that there exists some memory of the type under consid- 
eration which is stable. 

One can now perform an analysis identical to that performed in 
the previous section. Since the technique used to prove the stability 
theorem for noisy memories does not rely upon the assumption that 
component errors are statistically independent from use to use, pre- 
cisely the same technique used previously can be used here to prove 
that periodically maintained memories can be stable. In fact, in most 
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cases, the changes required to make this proof apply when compo- 
nents fail permanently merely involve replacing the words "com- 
ponent error" with the words "component failure." 

One change which requires some reinterpretation of terms involves 
the concept of a propagation failure. This concept was introduced for 
the purpose of establishing a condition under which the parity checks 
used to estimate any particular digit would be conditionally inde- 
pendent. To facilitate an intuitive discussion of propagation failures, 
the original definition of a propagation failure was made more gen- 
eral than necessary for the mathematical analysis. This analysis, in 
Appendix B, uses the fact that undesirable statistical dependencies 
can occur only if the effects of previous component malfunctions form 
a A propagation path in some parity check set tree. It is now desirable 
to redefine a propagation failure in terms of the formation of such a 
A propagation path. This is because permanent component failures 
can result in recurrent errors in a particular digit during m or more 
iterations thus causing a propagation failure, according to the original 
definition. However, since these recurrent errors do not lead to the 
undesirable statistical dependencies unless they also correspond to an 
undesirable A propagation path, the original definition of a propaga- 
tion failure should be changed to exclude recurrent errors. 

Once these changes have been made, if one represents the bounds on 
the probabilities of component failures by the same symbols that 
were used previously to represent the actual probabilities of com- 
ponent errors during each iteration, not only is the method of prov- 
ing the stability theorem identical to that used previously but so 
are the forms of all the results. Thus one proves the following theorem. 

Theorem 2: There is a stable sequence of memories where every com- 
ponent in each of the memories in the sequence has a nonzero prob- 
ability of permanent failure but where components which have failed 
are periodically replaced with good ones. 

IV. CONCLUSIONS 

In Section I we compare the results obtained by Shannon con- 
cerning the reliability of communication systems with the results ob- 
tained by von Neumann, Elias, Winograd, and Cowan concerning the 
reliability of computing systems. Shannon's results were basically 
different from the other results considered. Shannon was able to show 
that it is possible to design arbitrarily reliable communication sys- 
tems through which information can be transmitted at a nonzero 
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informcition rate. The maximum rate for which the probability of 
error can be made arbitrarily small is called the capacity of the com- 
munication channel. In analogy with this result, one might expect 
that it should be possible to design arbitrarily reliable computing 
systems which have a bounded redundancy. Unfortunately, none of 
the computing systems proposed previously has this property; there- 
fore, none of these computing systems has a nonzero "computing 
capacity." 

In Sections II and III we restrict our attention to one part of a 
computing system, namely the memory. It is shown that there are 
noisy memories and periodically repaired memories constructed from 
failure-prone components which have the property that the prob- 
ability of failure can be made arbitrarily small for certain bounded 
values of the redundancy. The memories which have this property 
are called "stable memories." This result is analogous to Shannon's 
result and is basically different from the other results obtained thus 
far concerning the reliability of computing systems. The fact that it 
is possible to make a memory arbitrarily reliable while keeping its 
redundancy bounded indicates that a memory has an "information 
storage capacity" analogous to the capacity of a communication 
channel. The information storage capacity of a particular memory 
equals the reciprocal of the minimum redundancy for which the 
memory is stable; hence it can be expressed in bits per component. It 
is a function of the probabilities of error for the components within 
the memory. The method used to prove the stability theorem does 
not allow one to compute an explicit value for the information storage 
capacity of the memories which were considered; however, the fact 
that these memories can be stable indicates that they do have a non- 
zero information storage capacity. 
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APPENDIX A 

A Bound on the Probability of Error Per Digit 

The first step in computing a bound on the probability of a propa- 
gation failure is to bound the probability of error for any digit stored 
in the registers within the memory. We assume that initially one set 
of J code words is transmitted across the noisy channels and inserted 
into these noisy registers. Some time during the next r seconds the 
correcting network reads the contents of the registers and starts to 
perform the first correcting cycle on the newly inserted digits. The 
time at which this first correcting cycle starts is denoted by t = and 
successive correcting cycles start at t = t, 2t, 3t, . . . . We denote the 
instant just before the end of the first correcting cycle by t = t— 8. If a 
digit is in error at t = t-8, at least one of the following events must 
have occurred: 

(t) An error was made in transmitting the digit across the BSC. The 
probability of this event is p c . 

(ii) The digit was changed because of a component error in the register 
which occurred either during the time interval — T<t^QorO<t^ 
t — 8. The probability of this event is less than 2p r . 
Therefore, by the union bound, the probability of error per digit at 

t = t — 8 is bounded by 

Pr [digit = e at t - r — 6] < 2p r + p e < p 

where the parameter p has been introduced to simplify the form of 
the results. Other conditions on p will be imposed later. 

Next let us compute a bound on the probability of error per digit 
at t = 2t— 8, 3t— 8, .... If the digit d is in error at any one of these 
times, at least one of the following events must have occurred: 

(i) A set of J/2 parity checks used to estimate c/ were in error 
during the last correcting cycle performed on d . 

(ii) The decision device made an error during the last correcting 
cycle. 

(Hi) An error occurred while d„ was stored in the register. 

There are f _ . J possible events of the first type. We now compute 

the probability that any one of these events occurred. If the ith parity 
check used to estimate d , c, , were in error there must have been at 
least one error among the K — 1 adders used to evaluate this parity 
check, or at least one error among the K — 1 digits denoted by d it 
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da , • • • , diK-i ■ According to the union bound, the probability that c, 
(for any < i ^ J — 1) is in error is bounded by 

Pr [C- = e] < 2 Pr W« -■] + <£- 1)P. • 

i-i 

During the first correcting cycle Pr [d ti = e] < p for all < j ^ K — 1; 
therefore, for this iteration 

Pr [c, = e during first correcting cycle] < (K — l)p + (K — l)p a . 

To compute the probability of error per digit just before the end of 
the second iteration, we use the fact that, during the first m iterations, 
the structure of the code guarantees that errors in the parity checks used 
to estimate any particular digit are statistically independent. To see 
this, consider the parity-check set tree rising from the digit do as shown 
in Fig. 6. Each node on this tree represents a particular digit. With each 
digit there is associated a set of components used to compute each 
estimate of the digit. Just as the tree represents a history of the digits 
which have been involved in the computation of the successive estimates 
of d , it also represents a history of the components which have been 
involved in the computation of these estimates. The later interpretation 
is more useful for our purposes since it is the components which cause 
the errors. If the code has m independent iterations, all the digits on the 
first m tiers of the tree must be different and all the components asso- 
ciated with these digits must be different also. There are (K — I) (J — 1) 
digits on the first tier of this tree. Provided that m ^ 1, the errors in 
these digits after the first iteration must be statistically independent 
since the digits are all different, and hence the components used to 
compute the estimates of these digits are all different. (It is assumed that 
errors in different components are statistically independent.) In general, 
the errors in the digits on the first tier of the tree, and hence the J — 1 
parity checks, are statistically independent provided that the sets of 
components used to compute the estimates of these digits are disjoint. 
The structure of the tree guarantees that this condition will be satisfied 
for the first m iterations. 

Since, during the first m iterations, the errors in the parity checks 
used to estimate any particular digit are statistically independent, 
the probability that a set of J/2 parity checks is in error equals the 
product of the probabilities that each one of these 7/2 parity checks 
is in error. Thus the probability of error per digit at t = 2t — 8 is 
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bounded by 
Pr [digit = e at t = 2r - 5] 



< ( J J/2 *)[<£ ~ D(Po + p a )Y /2 +V* + Vr 

A 
= Pi • 

Since we are not attempting to show that all memories of the type 
under consideration are stable but merely that there exist some mem- 
ories of this type which are stable, we shall restrict our interest to 
those memories for which it is possible to make p x < p . This is not 
a serious restriction since in most cases there is no difficulty in bound- 
ing pi by p . For example, if 

p a , p r and p d = 10~°, p = 10" 8 , J = 14 and K = 15, then?! « 2.10" 9 
illustrating one case where p x < p . 

Precisely the same argument can be used to obtain a bound on the 
probability of error per digit at t = 3t— 8, 4t— 8, . . , (w+1)t— 8. The 
results are: 

Pr [digit = e at t = 3r - 8] 

= p 2 < Pi < Po ■ 

Similarly 

Pr [pigit = e at t = (m + 1)t — 8] 

< ( J jJ 2 ^[(K - l)(p m -x + Va )Y n + Vd + p T 

= Pm < Pm-l < • •• <Pi <Po . 

If no propagation failure occurs at t = vw, the error pattern 
evaluated at that time depends on the component errors that occurred 
during the previous m iterations, but not on the original errors that 
were present at t — 0. Imposing this condition changes the probability 
of error per digit at t = (??i+1)t— 8; however, as we now show, the 
probability computed above is an upper bound on this conditional 
probability. Using Bayes rule, 
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Pr [digit = 1 1 no propagation failure] 

Pr [no propagation failure | digit = e] • Pr [digit = e] 

Pr [no propagation failure] 

It is easily shown, using techniques similar to those in Appendix B, 
that 

Pr [no propagation failure | digit = e] ^ Pr [no propagation failure], 
therefore 

Pr [digit = e | no propagation failure] ^ Pr [digit = e]. 

If no propagation failure has occurred, the errors in the set of parity- 
checks used to estimate any particular digit are conditionally in- 
dependent. This is because, if there is no propagation failure, the er- 
rors in each set of parity checks depend on errors in unrelated, dis- 
joint sets of components. Thus the technique used to bound the 
probability of error per digit during the first m iterations can be ap- 
plied after the with iteration provided that no propagation failure has 
occurred. The general result is 

Pr [digit = e at t = (i + 1)t — 8 \ no propagation failure] 

< ( ,/ . / 7 2 1 ) [(7s: ~ 1)(p - 1 + P - )]J/2 + p " + V ' 



± 



= Pi < p,_, < • • • <Pi <p . 

This monotonically decreasing sequence of bounds shows that the 
probability of error per digit is upper bounded by p provided that 
it is possible to choose p , J, and K such that p x < p and provided 
that no propagation failure occurs. In general, the probability of 
error per digit is upper bounded by p t provided that the memoiy has 
performed i or more correcting cycles and provided that the condi- 
tions stated above are satisfied. 

APPENDIX B 

The Probability of a Propagation Failure 

Intuitively, the concept of a propagation failure is very simple. A 
propagation failure occurs whenever the present error pattern in the 
registers is in some way related to the component errors that oc- 
curred more than m iterations before, where m is the number of in- 
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dependent iterations. This intuitive concept can be made more pre- 
cise by defining "error configurations" which are hypothetical error 
patterns that are functions of some subset of the actual set of com- 
ponent errors. In particular, a type i error configuration, for any i 
> 0, is a ./-N-tuple corresponding to the error pattern that would be 
present if there had been no component errors before the last i itera- 
tions. If no propagation failure has occurred, all error configurations 
of type m and higher must be identical, thus we are really interested 
in the difference between error configurations. For this reason, it is 
more convenient to restate the definition of a propagation failure in 
terms of "A configurations" where, for all i > 0, the type i A configura- 
tion is the difference between the type i and type i+1 error configura- 
tions. The type A configuration is defined to be equal to the type 1 
error configuration. A propagation failure occurs if there are one or 
more l's in any A configuration of type m or higher. 

Let us consider the situations which, for any i, lead to a 1 in the 
type i A configuration evaluated at some particular time, say t = 
Lt. Suppose that there is a 1 in the position corresponding to the digit 
do in this A configuration. This means that the value of the digit d 
in the type i error configuration is different from that in the type 
£+1 error configuration where both configurations are evaluated at 
t = Lt. This change in the value of d must be related to component 
errors that occurred during the time interval (L— i— l)r < t < {L—i)t 
since all the other component errors upon which the type i and type 
i+1 error configurations are based are the same. We refer to the com- 
ponent errors that occurred during the time interval (L— i— 1)t < t < 
(L—i)r as the controlling errors for the type i A configuration eval- 
uated at t = Lt and we say that the value of d at t = Lt has been 
changed by these controlling errors. 

If the value of d is changed at t = Lt, the controlling errors must 
have changed at least one of the digits used to estimate d . These arc 
the digits on the first tier of the parity-check set tree rising from do, 
and changes in them are represented by l's in the appropriate posi- 
tions in the type i— 1 A configuration evaluated at t = (L— 1)t. In gen- 
eral these controlling errors must have caused changes in some digits 
on the Zth tier of the parity-check set tree, the changes being rep- 
resented by l's in the appropriate positions in the type i—l A con- 
figuration evaluated at t = (L—l) T . These changed digits define at least 
one continuous path in the parity-check set tree rising i tiers from 
do. We call these paths A propagation paths. The particular A prop- 
agation path which has just been described is referred to as the i-tier 
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A propagation path rising from d at t = Lt (see Fig. 8) . Every 1 in a 
type i A configuration must have at least one i-tier A propagation 
path associated with it. To bound the probability that there is a 1 in 
the entry corresponding to the digit d in the type i A configuration 
evaluated at t = Lt, we shall bound the probability that one or more 
i-tier A propagation paths rise from the digit d at t = Lt. 

Since A propagation paths must be continuous, for each 1 in a type 
i A configuration evaluated at t = Lt there must have been at least 
one 1 in a type i— 1 A configuration evaluated at t = (L— 1)t. If no 
propagation failure occurred before t = Lt, the A configurations of 
type m and higher must have beeen all zero for all t < Lt. This im- 
plies that the A configurations of type m+1 and higher must be all 
zero at t = Lt. Hence, the only way that the initial propagation fail- 
ure can occur at t = Lt is if there is a 1 in the type m A configura- 
tion evaluated at that time. Therefore, to bound the probability that 
the first propagation failure occurs at some particular time, we need 
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Fig. 18 — An example of an i-tier A propagation path rising from do at t =; Lt, 
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only bound the probability of one or more l'a in the type m A con- 
figuration evaluated at that time. Since we assume that the memory 
fails whenever the first propagation failure occurs, we shall never be 
concerned with computing the probability of a 1 in any A configura- 
tion of type ra+1 or higher. It is important that we can restrict our 
attention to the type m A configuration since this A configuration can 
be computed for any particular time by considering only the compo- 
nent errors that occurred during the previous m+1 correcting cycles. 
The first of these correcting cycle results in statistically independent 
digit errors and, as explained in Appendix A, the structure of the 
code guarantees that during the next m correcting cycles the errors in 
the parity checks used to estimate any digit are statistically inde- 
pendent. 

To bound the probability that the initial propagation failure oc- 
curs at t = Lt, we must compute a bound on the probability of one or 
more l's in the type m A configuration evaluated at t = Lt. This is 
done by bounding the probability that an ?n-tier A propagation path 
rises from one or more of the J-N digits in the registers within the 
memory at t = Lt. At first we restrict our attention to one particular 
digit d . A bound is computed on the probability that an m-tier A 
propagation path rises from da at t = Lt. The first step in this com- 
putation is to bound the probability that the component errors that 
occurred during the time interval (L— in— 1)t < t < (L—m)T would 
cause any particular digit on the ??ith tier of the parity-check set tree 
rising from d to be in error at t = (L—m)T. Any m-tier A propaga- 
tion path rising from d at t = Lt must terminate on one of these er- 
rors which we call new eirors. The next step is to bound the prob- 
ability that at t = (L— m+i)r an t-tier A propagation path rises from 
any particular digit on the (m—i) th tier of the parity-check set tree 
rising from d . This probability is denoted by Pr[d m - i =A i \. By sub- 
stituting m for i, we obtain a bound on the probability that at t = 
Lt an m-tier A propagation path rises from the digit d (that is, 
Pr[da = A w ]). The probability that an m-tier A propagation path 
rises from one or more of the J-N digits at t = Lt is upper bounded 

byJ-N-Pr[do = ± m ). 

The first step, namely bounding the probability of a new error at 
t = (L— m)i-, is particularly simply. In Appendix A we computed a 
bound on the probability of error per digit which was denoted by p . 
Since this bound was computed by considering all possible errors that 
could exist at a particular time, it must certainly be a bound on the 
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probability of a new error at some particular time. Therefore, p is a 
bound on the probability that a new error occurs at t = (L—m)T. 

The next step is to bound Pr K,-, = A,] for all 1 £ t £ m. Let us 
consider a particular digit on the (m — i)th. tier of the parity-check set 
tree rising from d and denote this digit by d m _,- . Now consider the 
conditions which must be satisfied if the value of d M _< is changed by the 
controlling errors; that is, if an i-tier A propagation path rises from d m _,- 
at t = (L — tn + i)T- 

To describe this "change" in more detail, we must consider two sets of 
component errors. One is the set of component errors that occurred 
since t = (L — m — l)r and the other is the set of component errors that 
occurred since t = (L — m)r [assume that no component errors occurred 
before t = (L — m — 1)t and t = (L — ra)r, respectively]. When we say 
that the value of d„- t has been changed by the controlling errors, we 
mean that the value of d m -i at t = (L — m + i)r is correct when it is 
computed under the assumption that one of these sets of component 
errors actually occurred, whereas it is incorrect when it is computed 
under the assumption that the other set of errors actually occurred. 
There are two necessary conditions for this change. Assume that the 
value of d m -t was changed by the controlling errors. Denote the set 
of component errors for which d m . t = e by S, and the set for which 
d m -i J* e by S corr «. ct • If the value of d„_< is changed by the controlling 
errors, both of the following conditions must be satisfied : 

(i) There must have been at least one parity check used to estimate 
d m -i which was wrong [at t = (L - in + i)r\ under the assumption 
that S« occurred but which was correct under the assumption that 
Soorroct occurred. Denote one of these parity checks by c A . 

(ii) On the basis of the errors in the set S« , J/2 — 1 or more parity 
checks other than c A must have been wrong at t = (L — m + i)r. 

In order for condition i to be satisfied, the value of at least one of 
the (J-l) (J£-l) digits immediately above d„,-i in the parity-check 
set tree must have been changed [at t = {L— m+i— l)r] by the con- 
trolling errors. The probability of such a change has been denoted by 
Pr[d m _ i+1 — A f _i]. Therefore, the probability that the value of one or 
more of these digits was changed is upper bounded by 

Pr [condition i is satisfied] 

< Pr [value of any digit immediately above d mi is changed by con- 
trolling errors] 

< (J - \){K - 1) Pr K„_, +1 = A,-J. 
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A bound on the probability that condition ii is satisfied was derived 
in Appendix A. This bound equals 

Pr [condition (ii) is satisfied] < (j~J\k - l)(p + p a )] //2_1 . 

As explained previously, the structure of the code guarantees that 
parity-check errors are independent; hence, during the m iterations 
of interest, these two conditions arc independent. Therefore, the prob- 
ability that both conditions are satisfied, which is a bound on Pr[d m t 
= A,], is given by 

Pr [d„-i = A,] < (J - 1)(K - 1) Pr [r/,„_, + 1 = A,.,] 

Substituting i = 1, 2 . . . m, we obtain 

Pr K,_, = A,] < (J - l)(/v - l)Pn( J / 2 "_ 2 1 )[(A' - l)(p + p.)]" 2 " 1 

Pr [d m - 2 = A 2 ] < (.7 - 1)(K - 1) Pr [d M _, = A,] 

J - 2 



, ., L ,[(K - l)(p + p a )]" 2 



<p «J - 1)(A-- 1) 



{ )[(K - l)(p + p,,)]- 



/ _ 9 

- llYA' — iV-n. -L. -n M y/2 

.7/2 - 1 



Pr[rfo = A,„] <p„i(.7 - 1)(A- 1) 

•* _ 2 1 / . 1 .:-' 

v.7/2 - h 

Gallager has found a technique for constructing low-density parity- 
check codes 10 with in, the number of independent iterations, bounded 

by 



J[(K - i)(p + p.)]- 



. [N_ _ N 1 

10g 12K 2J(K - 1)_ 



£»* *** 



2 log [(J - 1)(A - 1)] - "" = log [(J - 1)(K - 1)] 
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Substituting this lower bound into the equation for Pr[d = A m ] gives 



Pr [do - A m ] < p |(J - l)(K - l)[/ /2 _ 2 J 



[(K-l)(2p )] 



J/2-1 



J - 2 
J/2 - \, 

log [(ff/2K)-{N/2JOC-l)\)/2 log KJ-l)iK-l)\ 



where 



(3 



log {(/ - D(g - i)G/2~-. 2 iV - gg^j 



2 log (J - 1)(K - 1) 

We have assumed that p has been chosen such that po > Pa- For any 
L, probability that the initial propagation failure occurs at t = Lt is 
bounded by 

Pr [initial propagation failure occurs at t = Lt] 

= for L < m 

< JN-Fr [do - A m ] for L ^ m 

since, by definition, a propagation failure cannot occur before t = mr. 
We have chosen to number the memories according to their informa- 
tion storage capability, k. We can express this result in terms of k 
by noting that N = k/R and 1 - J/K £ A £ 1; therefore, 



Pr [initial propagation failure occurs at t = Lt] 



= 
< J 

= C-k' 



-* 



•Po" 



where 



C^ 



J 



1 - J/K 



r^ - i \\ 

_2K 2J(K - 1)J 7 



' Po 'L2K 2J(K-1)J 



for L < m 

for L > m 



Both C and /? are functions of J, K and p - For any particular se- 
quence of memories, J, K and p will all be constants. For example, if 
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J = 14, K = 15, and p = 10 -8 , then y3 = 7.55; therefore, in this case, 
the probability that the first propagation failure occurs at t = Lt (for 
any L) is bounded by a function that decreases as the sixth power of 
the information storage capability of the memory. By choosing the 
information storage capability sufficiently large, this probability can 
be made arbitrarily small. 
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